{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8630f18b",
   "metadata": {},
   "source": [
    "\n",
    "<h2 align=\"left\"><span style=\"color:#44546A\">Sepsis dataset</span></font></h2>\n",
    "<h3 align=\"left\"><span style=\"color:#44546A\">source of information: discharge summaries</span></font></h3>\n",
    "\n",
    "<br>\n",
    "Hospitalizations with sepsis, defined as those with a diagnosis recorded in the discharge summary at the end of hospitalization and coded as sepsis, according to the 10th version of the International Classification of Diseases (ICD-10). These are, therefore, cases explicitly classified by physicians. Implicit cases of sepsis were not included (e.g., diagnoses of infection plus codes for organ dysfunction). A single patient may have more than one hospitalization with this diagnosis. This initial group contained 1,757 patients and 1,940 hospitalizations.<br>\n",
    "The text of each selected hospitalization was analyzed, selecting the best descriptions from the discharge summary. The criteria used were: presence of descriptions of the history of the present illness, physical examination, laboratory and/or imaging tests, and treatment performed during hospitalization, necessarily including the use of antibiotic therapy and the diagnosis by ICD-10 (suspected or confirmed) of sepsis. Therefore, this group underwent a second manual selection by a physician \n",
    "<br>\n",
    "The priority in handling the data from this dataset was maintaining patient confidentiality; further details can be found in the article correlated with this notebook. The diagnostic information, contained in the medical record, was extracted and reviewed, and in specific cases (e.g., rare diseases) some diagnoses were excluded.\n",
    "<br>\n",
    "\n",
    "\n",
    "Data Description:<br><br>\n",
    "01 - `cod_paciente`: The patient code is the unique identifier for each patient within the hospital. In this dataset, the registration number was replaced with a sequential number. Thus, the same patient received different identification numbers, which we consider important to maintain patient de-identification.<br><br>\n",
    "02 - `seq_atendimento`: The service sequence corresponds to a key that filters all records of a specific care (e.g., hospitalization) for each patient  in the electronic medical record. This key was deleted to de-identify the hospitalization and because, in text processing, it corresponds to the sequential number that replaced the patient code.<br><br>\n",
    "03 - `profissao`: It refers to the type of professional who made the record in the patient's electronic medical record, such as a doctor, nurse, speech therapist, etc. Only doctors who treated the patient can record the discharge summary; therefore, this record is unnecessary and was deleted.<br><br>\n",
    "04 - `ctu_informacao`: This field corresponds to the patient's complete medical record, containing details of all treatments and services received at the hospital's units. For this particular dataset, it specifically contains the patient's discharge summary.<br><br>\n",
    "05 - `ctu_informacao_len`: the number of words in the text registered in ctu_informacao before processing.<br><br>\n",
    "06 - `pep_anotado_p94r91/model-best`: a NER model trained by author using Spacy and patient medical records.<br><br>\n",
    "07 - `abreviaturas.csv`: a dictionary of medical abbreviations was created comprising a total of 309 abbreviations and their respective meanings. The dictionary was used to replace the abbreviations with full words. Ambiguous abbreviations such as \"pcr\" (which can mean polymerase chain reaction or cardiorespiratory arrest in Portuguese) were not replaced.<br><br>\n",
    "08 - `complementary information`- Supplementary information from diverse sources within the patient's medical record was compiled and added to the dataset.  This automatic process involved external extraction, processing, and handling of missing data, resulting in a separate file.  Records from this file were linked to the main dataset where the `seq_atendimento` values were identical. The supplementary data fields are: total length of stay, number of medical specialties consulted, ICU usage, palliative care status, and discharge type<br><br>\n",
    "\n",
    "\n",
    "#### Flow of the notebook\n",
    "\n",
    "1. [Import an unprocessed sepsis dataset](#section01)\n",
    "2. [Extract diagnosis - manually](#section02)\n",
    "3. [Health record text processing](#section03)\n",
    "4. [Process medical abreviations](#section04)\n",
    "5. [Incorporating complementary information into the dataset](#section05)\n",
    "6. [Descriptive statistics](#section06)\n",
    "\n",
    "\n",
    "\n",
    "#### OtherDetails\n",
    "\n",
    "<br>**the final dataframe `dataset_desidentificado` was exported to a csv file and manually analysed by a physician to correct errors or improve information. Some information was manually deleted to maintain patient confidentiality**<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77c3e116",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from io import StringIO\n",
    "from tqdm import tqdm\n",
    "tqdm.pandas()\n",
    "import re\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc73c17b",
   "metadata": {},
   "source": [
    "<a id='section01'></a>\n",
    "<h3 align=\"left\"><span style=\"color:#44546A\">1. Import an unprocessed sepsis dataset</span></font></h3>\n",
    "<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c09b7341",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_set = pd.read_csv(\"sepsis_dataset.csv\", encoding='utf-8', sep=';')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a546c8e4",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "data_set.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "235c2206",
   "metadata": {},
   "outputs": [],
   "source": [
    "n = len(pd.unique(data_set['cod_paciente']))\n",
    "\n",
    "print(\"No. patients:\",\n",
    "      n)\n",
    "\n",
    "n = len(pd.unique(data_set['seq_atendimento']))\n",
    "\n",
    "print(\"No. hospitalizations:\",\n",
    "      n)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b80f419b",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_set.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7da3578",
   "metadata": {},
   "source": [
    "<a id='section02'></a>\n",
    "<h3 align=\"left\"><span style=\"color:#44546A\">2. Health record text processing</span></font></h3>\n",
    "<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30cd7b37",
   "metadata": {},
   "outputs": [],
   "source": [
    "def tratar_texto(texto):\n",
    "        texto = texto.lower()\n",
    "        texto = re.sub(r'#', '', texto)\n",
    "        texto = re.sub(r'\\+', '', texto)\n",
    "        texto = re.sub(r'>+', '', texto)\n",
    "        texto = re.sub(r'-+', '', texto)\n",
    "        texto = re.sub(r'=+', '', texto)\n",
    "        texto = re.sub(r'\\*+', '', texto)\n",
    "        texto = re.sub(r'\\~+', '', texto)\n",
    "        texto = re.sub(r'>', '', texto)\n",
    "        texto = re.sub(r'\\(', '', texto)\n",
    "        texto = re.sub(r'\\)', '', texto)\n",
    "        texto = re.sub(r'\\//', '', texto)\n",
    "        texto = re.sub(r'>>', '', texto)\n",
    "        texto = re.sub(r'xx{2,}', '', texto)\n",
    "        texto = re.sub(r'\"', '', texto)\n",
    "        texto = re.sub(r'\\.{2,}', '.', texto)\n",
    "        texto = re.sub(r'\\|', ' ', texto)\n",
    "        texto = re.sub(r'\\s+', ' ', texto)\n",
    "        return texto"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93b41de0",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_set['ctu_informacao_ttdo'] = data_set['ctu_informacao'].apply(lambda x: tratar_texto(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a55c39e",
   "metadata": {},
   "source": [
    "<a id='section02'></a>\n",
    "<h3 align=\"left\"><span style=\"color:#44546A\">2. Extract diagnosis from dataset</span></font></h3>\n",
    "<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a19f5bfc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# To guarantee patient confidentiality, this procedure was performed manually by a doctor.\n",
    "# This was done during the final evaluation of the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebd2e163",
   "metadata": {},
   "source": [
    "<a id='section03'></a>\n",
    "<h3 align=\"left\"><span style=\"color:#44546A\">3. De-identify the dataset</span></font></h3>\n",
    "<br>\n",
    "two approaches - non-supervised and supervised followed (at the end) by a human analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b50ee248",
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# non-supervised\n",
    "from gliner import GLiNER\n",
    "model = GLiNER.from_pretrained(\"urchade/gliner_multi_pii-v1\") # for NER identification of personally identifiable information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "925bee5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# functions for de-identification\n",
    "\n",
    "def predict_entities(texto):\n",
    "    entities_list = model.predict_entities(texto, labels_g1, threshold=0.5)\n",
    "    df = pd.DataFrame(entities_list)\n",
    "    return df\n",
    "\n",
    "def replace_with_labels(text, entities_df):\n",
    "    for _, row in entities_df.iterrows():\n",
    "        text = text.replace(row['text'], f\"[{row['label']}]\")\n",
    "    return text\n",
    "\n",
    "def desidentificar_texto(model, text, labels):\n",
    "    entities_df = predict_entities(text)\n",
    "    return replace_with_labels(text, entities_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7aeae804",
   "metadata": {},
   "outputs": [],
   "source": [
    "# GLiNER processes a maximum of 348 tokens. For larger texts, it's necessary to partition the text into chunks.\n",
    "# Null ctu_informacao (there is none) values are not included in the chunks.\n",
    "# It's important to account for the space that arises between words and punctuation (.,:) due to the tokenizer.\n",
    "\n",
    "from nltk.tokenize import word_tokenize\n",
    "from typing import List\n",
    "\n",
    "tamanho_chunk = 150\n",
    "\n",
    "# explit text funcion\n",
    "\n",
    "def split_text_into_chunks(text: str, chunk_size: int) -> List[str]:\n",
    "    tokens = word_tokenize(text)\n",
    "    chunks = [' '.join(tokens[i:i + chunk_size]) for i in range(0, len(tokens), chunk_size)]\n",
    "    # remove extra spaces\n",
    "    chunks = [re.sub(r'\\s([.,:?%])', r'\\1', chunk) for chunk in chunks]  \n",
    "    return chunks\n",
    "\n",
    "new_rows = []\n",
    "for _, row in data_set.iterrows():\n",
    "    chunks = split_text_into_chunks(row['ctu_informacao_ttdo'], chunk_size=tamanho_chunk)\n",
    "    for chunk in chunks:\n",
    "        new_row = row.copy()\n",
    "        new_row['ctu_informacao_ck'] = chunk # splited text\n",
    "        new_rows.append(new_row)\n",
    "\n",
    "# the original data will be repeted\n",
    "\n",
    "caso_chunk = pd.DataFrame(new_rows)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4384b4f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# labels for NER identification\n",
    "\n",
    "labels_g1 = ['person', 'organization', 'address', 'postal code', 'phone number', 'email', 'username',\n",
    "             'mobile phone number', 'identity card number', 'national id number','registration number',\n",
    "             'student id number','digital signature', 'social media handle','license plate number', 'serial number',\n",
    "             'fax number', 'identity document number','social_security_number']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9116120",
   "metadata": {},
   "outputs": [],
   "source": [
    "tqdm.pandas()\n",
    "caso_chunk['desidentificada'] = caso_chunk['ctu_informacao_ck'].progress_apply(\n",
    "    lambda text: desidentificar_texto(model, text, labels_g1)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6715b7e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "## join the chunks\n",
    "\n",
    "dataset_desidentificado = caso_chunk.groupby(['cod_paciente', 'seq_atendimento','ctu_informacao'])['desidentificada'].apply(' '.join).reset_index() \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0967128",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_desidentificado.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62338e0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# supervised\n",
    "\n",
    "import spacy\n",
    "nlp = spacy.load('pep_anotado_p94r91/model-best')\n",
    "print(nlp.pipe_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9cfcd0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_desidentificado['desidentificada_sup']=dataset_desidentificado['desidentificada'].astype('str').apply(nlp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7e0a9a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# extract the labels for manual validation\n",
    "\n",
    "seq_atendimento = []\n",
    "textos = []\n",
    "label = []\n",
    "\n",
    "for idx, doc in enumerate(dataset_desidentificado['desidentificada_sup']):\n",
    "    for ent in doc.ents:  \n",
    "        textos.append(ent.text)\n",
    "        label.append(ent.label_)\n",
    "        seq_atendimento.append(dataset_desidentificado['seq_atendimento'].iloc[idx])\n",
    "\n",
    "validar_desident_sup = pd.DataFrame({'seq_atendimento': seq_atendimento, \n",
    "                                   'texto': textos,\n",
    "                                   'label': label})\n",
    "\n",
    "validar_desident_sup = validar_desident_sup.groupby('seq_atendimento').apply(lambda x: x.drop_duplicates(subset=['texto'])).reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a06a07a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_desidentificado.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e33c5d86",
   "metadata": {},
   "outputs": [],
   "source": [
    "def desidentificar_anos(texto):\n",
    "    datas = re.findall(r'\\b\\d{2}/\\d{2}/\\d{4}\\b', texto)\n",
    "    datas_datetime = pd.to_datetime(datas, format='%d/%m/%Y', errors='coerce')\n",
    "    datas_datetime = datas_datetime.dropna()\n",
    "    datas_datetime = pd.Series(datas_datetime)  \n",
    "    anos_originais = sorted(datas_datetime.dt.year.unique())\n",
    "    anos_substitutos = {ano: 2020 + i for i, ano in enumerate(anos_originais)}\n",
    "    datas_datetime = datas_datetime.apply(lambda x: x.replace(year=anos_substitutos[x.year]))\n",
    "    datas_string = datas_datetime.dt.strftime('%d/%m/%Y').tolist()\n",
    "    for data_original, data_modificada in zip(datas, datas_string):\n",
    "        texto = texto.replace(data_original, data_modificada)\n",
    "\n",
    "    return texto"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56ea5dc4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Aplica a função para desidentificar os anos na coluna 'texto'\n",
    "dataset_desidentificado['desidentificada'] = dataset_desidentificado['desidentificada'].apply(desidentificar_anos)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8d3d6cf",
   "metadata": {},
   "source": [
    "<a id='section04'></a>\n",
    "<h3 align=\"left\"><span style=\"color:#44546A\">4. Process Medical abreviations</span></font></h3>\n",
    "<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "950f9493",
   "metadata": {},
   "outputs": [],
   "source": [
    "abreviacao = pd.read_csv('abreviaturas.csv', sep=';')\n",
    "\n",
    "def substituir_abreviacoes(texto, abreviacoes):\n",
    "    for _, row in abreviacoes.iterrows():\n",
    "        texto_resumido = row['texto_resumido']\n",
    "        texto_expandido = row['texto_expandido']\n",
    "        # Usando regex para substituir apenas palavras completas\n",
    "        texto = re.sub(r'\\b{}\\b'.format(re.escape(texto_resumido)), texto_expandido, texto)\n",
    "    return texto"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "981aba24",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_desidentificado['desidentificada_expand'] = dataset_desidentificado['desidentificada'].apply(lambda x: substituir_abreviacoes(x, abreviacao))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "522decfe",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_desidentificado.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d361ea51",
   "metadata": {},
   "source": [
    "<a id='section05'></a>\n",
    "<h3 align=\"left\"><span style=\"color:#44546A\">5. Incorporating complementary information into the dataset</span></font></h3>\n",
    "<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed532807",
   "metadata": {},
   "outputs": [],
   "source": [
    "information = pd.read_csv('pep_sepsis_selecionado_parametro.csv', sep=';')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65d85785",
   "metadata": {},
   "outputs": [],
   "source": [
    "information.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3990b4b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_desidentificado = pd.merge(dataset_desidentificado, information[['seq_atendimento', 'tempo_internacao_final']], on='seq_atendimento', how='left')\n",
    "dataset_desidentificado = pd.merge(dataset_desidentificado, information[['seq_atendimento', 'num_especialidades']], on='seq_atendimento', how='left')\n",
    "dataset_desidentificado = pd.merge(dataset_desidentificado, information[['seq_atendimento', 'uti']], on='seq_atendimento', how='left')\n",
    "dataset_desidentificado = pd.merge(dataset_desidentificado, information[['seq_atendimento', 'paliativos']], on='seq_atendimento', how='left')\n",
    "dataset_desidentificado = pd.merge(dataset_desidentificado, information[['seq_atendimento', 'tipo_alta']], on='seq_atendimento', how='left')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "09a4d698",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_desidentificado.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b74ab7b",
   "metadata": {},
   "source": [
    "<a id='section06'></a>\n",
    "<h3 align=\"left\"><span style=\"color:#44546A\">6. Descriptive statistics</span></font></h3>\n",
    "<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4864fc46",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_desidentificado[\"ctu_informacao_len\"] = dataset_desidentificado[\"ctu_informacao\"].apply(lambda x : len(str(x).split()))\n",
    "dataset_desidentificado[\"desidentificada_len\"] = dataset_desidentificado[\"desidentificada\"].apply(lambda x : len(str(x).split()))\n",
    "dataset_desidentificado[\"desidentificada_expand_len\"] = dataset_desidentificado[\"desidentificada_expand\"].apply(lambda x : len(str(x).split()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "061c65cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "estatisticas_descritivas = dataset_desidentificado[['ctu_informacao_len', 'desidentificada_len', 'desidentificada_expand_len']].describe()\n",
    "print(estatisticas_descritivas)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e41211b",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = dataset_desidentificado[['ctu_informacao_len', 'desidentificada_len', 'desidentificada_expand_len']]\n",
    "\n",
    "data_long = pd.melt(data, var_name='medida', value_name='tamanho')\n",
    "\n",
    "# Cria o gráfico box-plot deitado com Seaborn\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.boxplot(x='tamanho', y='medida', data=data_long, orient='h', hue='medida', palette=\"vlag\", legend=False) \n",
    "\n",
    "plt.title('Box-plot of Text Lengh')\n",
    "plt.xlabel('Text Lengh')\n",
    "plt.ylabel('Texts')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a941c3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_long = pd.melt(dataset_desidentificado, \n",
    "                    value_vars=['ctu_informacao_len', 'desidentificada_len', 'desidentificada_expand_len'], \n",
    "                    var_name='measure', value_name='tamanho')\n",
    "\n",
    "# Cria o gráfico usando Seaborn\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.histplot(data=data_long, x='tamanho', hue='measure', \n",
    "             palette=\"icefire\", multiple=\"dodge\", \n",
    "             edgecolor='black')\n",
    "\n",
    "# Configura o gráfico\n",
    "plt.title('Histogram of text lenght')\n",
    "plt.xlabel('Text length')\n",
    "plt.ylabel('Number of records')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71cb8900",
   "metadata": {},
   "outputs": [],
   "source": [
    "estatisticas_descritivas_2 = dataset_desidentificado[['tempo_internacao_final', 'num_especialidades']].describe()\n",
    "print(estatisticas_descritivas_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29f20205",
   "metadata": {},
   "outputs": [],
   "source": [
    "# without missing values (99999)\n",
    "dataset_filtrado = dataset_desidentificado.loc[dataset_desidentificado['tempo_internacao_final'] != 99999]\n",
    "estatisticas_descritivas_2 = dataset_filtrado[['tempo_internacao_final', 'num_especialidades']].describe()\n",
    "print(estatisticas_descritivas_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a736e690",
   "metadata": {},
   "outputs": [],
   "source": [
    "estatisticas_descritivas_2 = dataset_desidentificado[['tempo_internacao_final', 'num_especialidades']].describe()\n",
    "print(estatisticas_descritivas_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc611c47",
   "metadata": {},
   "outputs": [],
   "source": [
    "contagem_categorias = dataset_desidentificado['uti'].value_counts()\n",
    "print(\"Intensive care unit hospitalization: 1.0 = ICU:\\n\", contagem_categorias)\n",
    "print()\n",
    "contagem_categorias = dataset_desidentificado['paliativos'].value_counts()\n",
    "print(\"Palliative care patients: 1.0 = Palliative care:\\n\", contagem_categorias)\n",
    "print()\n",
    "contagem_categorias = dataset_desidentificado['tipo_alta'].value_counts()\n",
    "print(\"Deaths: óbito = death, 0 - not classified/missing, não obito = transfer or discharge :\\n\", contagem_categorias)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2894884",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_desidentificado.to_csv('validar/dataset_desidentificado.csv',sep=';')\n",
    "validar_desident_sup.to_csv('validar/desident_sup.csv', sep=';')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e66f73e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# replace the cod_paciente with a sequential number\n",
    "new_cod_paciente = range(1, len(data_set) + 1)\n",
    "data_set.insert(0, 'new_cod_paciente', new_cod_paciente)\n",
    "data_set = data_set.drop('cod_paciente', axis=1)\n",
    "data_set = data_set.drop('seq_atendimento', axis=1)\n",
    "data_set = data_set.drop('profissao', axis=1)\n",
    "data_set = data_set.drop('indice', axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "556679e8",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "dataset_desidentificado"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5e4daeb",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdfdf341",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
